5 research outputs found
Built to Last or Built Too Fast? Evaluating Prediction Models for Build Times
Automated builds are integral to the Continuous Integration (CI) software
development practice. In CI, developers are encouraged to integrate early and
often. However, long build times can be an issue when integrations are
frequent. This research focuses on finding a balance between integrating often
and keeping developers productive. We propose and analyze models that can
predict the build time of a job. Such models can help developers to better
manage their time and tasks. Also, project managers can explore different
factors to determine the best setup for a build job that will keep the build
wait time to an acceptable level. Software organizations transitioning to CI
practices can use the predictive models to anticipate build times before CI is
implemented. The research community can modify our predictive models to further
understand the factors and relationships affecting build times.Comment: 4 paged version published in the Proceedings of the IEEE/ACM 14th
International Conference on Mining Software Repositories (MSR) Pages 487-490.
MSR 201
On utilizing an enhanced object partitioning scheme to optimize self-organizing lists-on-lists
Author's accepted manuscript.This is a post-peer-review, pre-copyedit version of an article published in Evolving Systems. The final authenticated version is available online at: http://dx.doi.org/10.1007/s12530-020-09327-4.acceptedVersio
Collaborative Cloud Computing Framework for Health Data with Open Source Technologies
The proliferation of sensor technologies and advancements in data collection
methods have enabled the accumulation of very large amounts of data.
Increasingly, these datasets are considered for scientific research. However,
the design of the system architecture to achieve high performance in terms of
parallelization, query processing time, aggregation of heterogeneous data types
(e.g., time series, images, structured data, among others), and difficulty in
reproducing scientific research remain a major challenge. This is specifically
true for health sciences research, where the systems must be i) easy to use
with the flexibility to manipulate data at the most granular level, ii)
agnostic of programming language kernel, iii) scalable, and iv) compliant with
the HIPAA privacy law. In this paper, we review the existing literature for
such big data systems for scientific research in health sciences and identify
the gaps of the current system landscape. We propose a novel architecture for
software-hardware-data ecosystem using open source technologies such as Apache
Hadoop, Kubernetes and JupyterHub in a distributed environment. We also
evaluate the system using a large clinical data set of 69M patients.Comment: This paper is accepted in ACM-BCB 202
On utilizing an enhanced object partitioning scheme to optimize self-organizing lists-on-lists
With the advent of “Big Data” as a field, in and of itself, there are at least three fundamentally new questions that have emerged, namely the Artificially Intelligence (AI)-based algorithms required, the hardware to process the data, and the methods to store and access the data efficiently. This paper (The work of the second author was partially supported by NSERC, the Natural Sciences and Engineering Council of Canada. We are very grateful for the feedback from the anonymous Referees of the original submission. Their input significantly improved the quality of this final version.) presents some novel schemes for the last of the three areas. There have been thousands of papers written regarding the algorithms themselves, and the hardware vendors are scrambling for the market share. However, the question of how to store, manage and access data, which has been central to the field of Computer Science, is even more pertinent in these days when megabytes of data are being generated every second. This paper considers the problem of minimizing the cost of retrieval using the most fundamental data structure, i.e., a Singly-Linked List (SLL). We consider a SLL in which the elements are accessed by a Non-stationary Environment (NSE) exhibiting the so-called “Locality of Reference”. We propose a solution to the problem by designing an “Adaptive” Data Structure (ADS), which is created by utilizing a composite of hierarchical data “sub”-structures to constitute the overall data structure (In this paper, the primitive data structure is the SLL. However, these concepts can be extended to more complicated data structures.). In this paper, we design hierarchical Lists-on-Lists (LOLs) by assembling a SLL into a hierarchical scheme that results in SLLs on SLLs (SLLs-on-SLLs) comprising of an outer-list and sublist contexts. The goal is that elements that are more likely to be accessed together are grouped within the same sub-context, while the sublists themselves are moved “en masse” towards the head of the list-context to minimize the overall access cost. This move is carried out by employing the “de-facto” list re-organization schemes, i.e., the Move-To-Front (MTF) and Transposition (TR) rules. To achieve the clustering of elements within the sublists, we invoke the Object Migration Automaton (OMA) family of reinforcement schemes from the theory of Learning Automata (LA). They capture the probabilistic dependence of the elements in the data structure, receiving query accesses from the Environment. We show that SLLs-on-SLLs augmented with the Enhanced OMA (EOMA) minimizes the retrieval cost for elements in NSEs, and are superior to the stand-alone MTF and TR schemes, and also superior to the OMA-augmented SLLs-on-SLLs operating in such Environments